Beyond twitter

Prototyping a data extraction pipeline for bluesky.social and exploration of bluesky user activity for influenza like digital disease detection

Heiner Atze

Digital Epidemiology 2025, Hasselt University

2025-04-10

Outline

  • The bluesky social network
  • Data accessiblity via the bluesky API
  • Extraction and Analysis of ILI related bluesky messages

Introduction

bluesky: general aspects

  • microblogging platform
  • similar to twitter in user experience
  • decentralized
  • open source

Decentralization and Democratization of content algorithms 1

  • Decentralized User Identifier (DID)

    • immutable, associated with human readable user handle
  • Personal Data servers (PDS)

  • DIDs and affiliated contents are portable between PSDs

  • Users can choose, prioritize and develop feed generators and content labelers

Development of user activity 1

  • current estimate: ca. 33 Millions active users
  • user base expanded in bursts after key events:
    • 2022: acquisition of twitter by Elon Musk
    • 2024: ban of X in Brazil, presidential election in the US

Literature addressing bluesky

  • Google scholar search : “bluesky” AND “social” since 2022

  • 43 articles

  • main topics:

    • decentralized social network architecture
    • user migration from X to bluesky 2024
    • network structure and dynamics
  • no results for

    • “bluesky” AND “disease”
    • “bluesky” AND “epidemiology”

Exploration of bluesky data

bluesky API

  • publicly accessible for free
  • extensive documenation at https://docs.bsky.app/docs/category/http-reference

searchPosts API method

  • API documentation

  • selected parameters:

    • q: search query
    • since, until: defining search period
  • limit: max. 100 posts

  • deterministic search

  • allows exhaustive sampling

Post metadata

  • defined in the SDK documentation

  • fields (selection):

    • uri: unique post identifier
    • author: contains did which allows to retrieve user profile
    • record: contains the text and time information of the message
      • langs: language(s) detected by the bluesky server
    • embedded: any embedded media (images, other posts, etc …)
  • in contrary to former twitter post metadata, no geoinformation

User information

  • Feedgens

  • Labelers

  • no geo information

getProfiles API endpoint

  • allows to retrieve the author profile information

Project

Outline

bluesky post data for digital disease surveillance

Implementation of a continuous surveillance pipeline

Data extraction

Basal network activity

  • Keywords:
    • travail (work)
    • demain (tomorrow)
    • voiture (car)
    • sommeil (sleep)
  • post counts aggregated by day

Case data

  • data downloaded from WHO Flumart
    • FluID: centralized epidemiological surveillance data

Results

Post count time series

Raw posts counts

Data analysis starting from 2023-08-01

Keyword posts vs. ILI incidence

ILI posts Control posts ILI incidence
ILI posts 1.000 0.863 0.773
Control posts 0.863 1.000 0.541
ILI incidence 0.773 0.541 1.000

Machine Learning

Features

  • weekly no. of control posts
  • weekly no. of posts containing ILI related keyword
  • normalized ILI related post counts
  • time and seasonal features
    • year
    • month
    • week
    • season
  • lag terms

Model structure

\(Y_w:\) ILI Incidence in week \(w\)

\(X_w:\) Input features obtained in week \(w\)

\[Y_{w+1} = f(X_w, X_{w-1}, X_{w-2})\]

Gradient boosted trees

  • Sequential learning of weak learners.
  • Iteratively corrects errors of previous models
  • Combines predictions using weighted averaging.
  • Robust to outliers
  • Handles non-linear relationships

Gradient boosting 1

Model evaluation

  • Time series split validation
    • retains temporal information
    • mimics continuous data acquisition
  • Setup:
    • initial training: 40 weeks
    • test set size: 1 week

Expanding window time series validation 1

Validation results

Predictions and metrics

  • Target variable: weeky ILI incidence one week ahead \(Y_{w+1}\)

Metrics

Dataset MAE* RMSE
Training \(23.96\) \(33.93\)
Validation + \(56.54\) \(56.54\)

* Mean absolute error, incidence per 100,000

+ mean over all validation runs

Permutation importance of the current model

  • model agnostic feature importance procedure

  • random shuffling of single input features

Can “AI” help?

Idea

  • Filter posts using large a large language model (LLM)

How?

  • provide case definition in the system prompt
  • use json structured output option for convenient data processing

Prompt and output

Prompt extract

Analyze the following tweet-like message to determine if it describes the user's own influenza-like illness (ILI). ILI is defined by:

- Fever ≥38°C (100°F) **AND**

- At least one respiratory symptom (cough or sore throat) **PLUS**

- Additional systemic symptoms (headache, muscle aches, chills, fatigue, nasal congestion)

...



{ ... bluesky message dynamically inserted here ... }
// symptom extraction schema
{
    "ili_related" :{
        "type":"bool"
    },
    symptoms:{
        "type":"array",
        "items":{
            "type":"string"
        }
    }
}
// symptom extraction example
{
    "ili_related" :true,
    symptoms:{
      ["fever", "headache"]
    }
}

Extraction using google Gemini API

Examples

ILI positive

Original message

Bon bah ça, c'est fait. En arrêt maladie jusque Lundi, pour état grippal.
Pas le Covid, ni la grippe, mais des migraines, courbatures, impression de pas avoir dormi depuis 15 jours.
Je vais peut-être pouvoir finir mon doc sur Boris Vian sur Arte, et finir mes séries.
Et vous faire chier ici 😈

Machine translation

Well, that's it. On sick leave until Monday, for influenza.
Not the covid, nor the flu, but migraines, aches, impression of not having slept for 15 days.
I may be able to finish my doc on Boris Vian on Arte, and finish my series.
And piss you here 😈

LLM symptoms

migraines,courbatures

ILI negative

Grippe aviaire : les coupes budgétaires de Trump amplifient la menace pandémique www.lepoint.fr/tiny/1-2586310 #Santé via @lepoint.fr
Aviary flu: Trump's budget cuts amplify the pandemic threat www.lepoint.fr/tiny/1-2586310 #health via @lepoint.fr
nan

LLM annotated post counts

Correlation

LLM ILI posts ILI incidence Control posts
LLM ILI posts 1.000 0.793 0.812
ILI incidence 0.793 1.000 0.557
Control posts 0.812 0.557 1.000

Predictions and metrics

  • Target variable: weeky ILI incidence \(w_{t+1}\)

Metrics

Dataset MAE* (LLM filtered)
Training \(26.64\)
Validation \(58.87\)

* Mean absolute error, incidence per 100,000

Permutation importance, LLM filtered posts

Conclusion

  • bluesky = promising data source
  • more data needed = patience

Outlook

  • investigate impact of LLM filtering on model performance

  • modeling of weekly ILI incidence based on message content

  • continuous data acquisition pipeline (WIP)

  • User localization based on profile

  • monitoring of bursts in user activity crucial

  • repeating the analysis for another country (e.g. Germany)

Pipeline (WIP)

graph LR
    subgraph kestra 
        dlt(dlt) --- posts
        llm --- bqstaging
        llm -- annotation --> bqstaging
        posts --> bqstaging[<b>GBQ</b> \n stage area \n 1 table per kw]
        dlt -- housekeeping --> count
        dlt -- case data --> who_tables
        dlt -- case data --> cdc_tables
        subgraph BigQuery data lake
          bqstaging
          who_tables
          cdc_tables
          count[post counts table]
        end
        bqstaging --- dbt
        who_tables --- dbt
        cdc_tables --- dbt
        count --- dbt
        dbt --> bq[Google \n BigQuery]
        subgraph BigQuery data warehoue
          bq
        end
    end

    bsky[bsky API] --> dlt
    WHO --> dlt
    CDC --> dlt
    bq --> looker[Looker studio \n dashboard]
    bq -- python --> stat1[Statistical analysis]
    bq -- python --> stat2[Machine learning, modeling]

Open source implementation

Data ingestion

  • python
  • data load tool (dlt)

Data storage and SQL

  • Google Big Query

Data modeling

  • data build tool (dbt)

Workflow orchestration

  • kestra

available at: https://github.com/kantundpeterpan/bluesky_ddd_influenza

Thank you for your attention.

References

Balduf, Leonhard, Saidu Sokoto, Onur Ascigil, Gareth Tyson, Björn Scheuermann, Maciej Korczyński, Ignacio Castro, and Michaŀ Król. 2024. “Looking at the Blue Skies of Bluesky.” In Proceedings of the 2024 ACM on Internet Measurement Conference, 76–91.
Duarte, Fabio. Bluesky User Age, Gender, & Demographics (2025).” https://explodingtopics.com/blog/bluesky-users.
How to Apply Stacking Cross Validation for Time-Series Data? — Datascience.stackexchange.com.” https://datascience.stackexchange.com/questions/41378/how-to-apply-stacking-cross-validation-for-time-series-data.
pythongeeks.org. Gradient Boosting Algorithm in Machine Learning - Python Geeks — Pythongeeks.org.” https://pythongeeks.org/gradient-boosting-algorithm-in-machine-learning/.
Signorini, Alberto Maria AND Polgreen, Alessio AND Segre. 2011. “The Use of Twitter to Track Levels of Disease Activity and Public Concern in the u.s. During the Influenza a H1N1 Pandemic.” PLOS ONE 6 (5): 1–10. https://doi.org/10.1371/journal.pone.0019467.